first load the data, and then take the sample size with nrow() function.
wine <- read.csv("data/winequality-red.csv", sep = ";")
sample_size <- nrow(wine)
print(paste('sample size is ', sample_size))
## [1] "sample size is 1599"
Draw a plot for each of the variables:
for (col_name in colnames(wine))
plot(wine[[col_name]], main = paste("distribution of ", col_name))
seems we have some outliers observed in the total.sulfur.dioxide variable.
The summary() function provides a basic summary of Min, 1st Quantile, median, third quntile, max:
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Moreover, I would like to include standard deviation to give a little bit more insight:
for (col_name in colnames(wine)) {
sd = sd(wine[[col_name]])
print(paste("standard deviation of ", col_name, ": ", round(sd, 2)))
}
## [1] "standard deviation of fixed.acidity : 1.74"
## [1] "standard deviation of volatile.acidity : 0.18"
## [1] "standard deviation of citric.acid : 0.19"
## [1] "standard deviation of residual.sugar : 1.41"
## [1] "standard deviation of chlorides : 0.05"
## [1] "standard deviation of free.sulfur.dioxide : 10.46"
## [1] "standard deviation of total.sulfur.dioxide : 32.9"
## [1] "standard deviation of density : 0"
## [1] "standard deviation of pH : 0.15"
## [1] "standard deviation of sulphates : 0.17"
## [1] "standard deviation of alcohol : 1.07"
## [1] "standard deviation of quality : 0.81"
Draw a histogram of each variable with hist() function, and draw a density curve on top of it.
for (col_name in colnames(wine)) {
hist(wine[[col_name]], main = col_name, freq = F)
lines(density(wine[[col_name]]), lwd = 5, col = "blue")
}
Yes, many variables like citric.acid, free.sulfur.dioxide, total.sulfur.dioxiode, alcohol, are skewed.
The author discussed linear/multiple regression (MR), neural networks (NN), and support vector machines (SVM). MR can be seen as a reduced form of NN when there’s no layer of hidden node. Empirical results shows that SVM outperformed NN (and also MR) in this study case, especially for white wine.